Documentation Index Fetch the complete documentation index at: https://mintlify.com/FrankDevg/imbd_scrapper_project/llms.txt
Use this file to discover all available pages before exploring further.
The IMDb Scraper uses a dual-source extraction strategy combining HTML parsing and GraphQL API calls to maximize data coverage and reliability.
Architecture Overview
The scraping engine is built around the ImdbScraper class, which implements the ScraperInterface and follows Clean Architecture principles:
class ImdbScraper ( ScraperInterface ):
def __init__ (
self ,
use_case : UseCaseInterface,
proxy_provider : ProxyProviderInterface,
tor_rotator : TorInterface,
engine : str ,
base_url : str = config. BASE_URL
):
self .use_case = use_case
self .proxy_provider = proxy_provider
self .tor_rotator = tor_rotator
self .engine = engine
self .base_url = base_url
Location: infrastructure/scraper/imdb_scraper.py:21
Dual-Source Data Collection
HTML Parsing Strategy
The scraper extracts movie IDs from the IMDb Top 250 chart using CSS selectors:
# Extract movie IDs from HTML
html_ids = [
a[ "href" ].split( "/" )[ 2 ]
for a in soup.select( "td.titleColumn a" )
if "/title/" in a[ "href" ]
]
Location: infrastructure/scraper/imdb_scraper.py:146
GraphQL API Integration
To supplement HTML data, the scraper queries IMDb’s GraphQL endpoint:
def _fetch_graphql_ids ( self , cookies : Optional[requests.cookies.RequestsCookieJar]) -> List[ str ]:
payload = {
"operationName" : config. GRAPHQL_OPERATION ,
"variables" : {
"first" : config. NUM_MOVIES ,
"isInPace" : False ,
"locale" : config. GRAPHQL_LOCALE
},
"extensions" : {
"persistedQuery" : {
"sha256Hash" : config. GRAPHQL_HASH ,
"version" : config. GRAPHQL_VERSION
}
}
}
response = make_request(
url = config. GRAPHQL_URL ,
proxy_provider = self .proxy_provider,
tor_rotator = self .tor_rotator,
method = "POST" ,
json_payload = payload
)
Location: infrastructure/scraper/imdb_scraper.py:158
GraphQL Configuration:
Endpoint: https://caching.graphql.imdb.com/
Operation: Top250MoviesPagination
Hash: 2db1d515844c69836ea8dc532d5bff27684fdce990c465ebf52d36d185a187b3
Locale: en-US
BeautifulSoup Selectors
The engine uses CSS selectors configured in shared/config/config.py:
SELECTORS = {
"title" : '[data-testid="hero__primary-text"]' ,
"year" : 'ul.ipc-inline-list li a[href*="releaseinfo"]' ,
"rating" : '[data-testid="hero-rating-bar__aggregate-rating__score"] span' ,
"duration_container" : 'ul.ipc-inline-list--show-dividers' ,
"metascore" : "span.metacritic-score-box" ,
"actors" : "a[data-testid='title-cast-item__actor']"
}
# Title extraction
title_tag = soup.select_one(config. SELECTORS .get( "title" , "" ))
title = title_tag.text.strip() if title_tag else ""
# Year extraction with regex validation
year_tag = soup.select_one(config. SELECTORS .get( "year" , "" ))
year_str = year_tag.text.strip( "()" ) if year_tag else "0"
year_match = re.search( r ' \d {4} ' , year_str)
year = int (year_match.group()) if year_match else 0
# Rating extraction
rating_tag = soup.select_one(config. SELECTORS .get( "rating" , "" ))
rating = float (rating_tag.text.strip()) if rating_tag else 0.0
# Metascore extraction (optional field)
metascore_tag = soup.select_one(config. SELECTORS .get( "metascore" , "" ))
metascore = int (metascore_tag.text.strip()) if metascore_tag else None
Location: infrastructure/scraper/imdb_scraper.py:85
Duration Parsing
The scraper handles IMDb’s varied duration formats (e.g., “2h 30m”, “1h 45m”, “90m”):
duration = None
ul_list = soup.select(config. SELECTORS .get( "duration_container" , "" ))
for ul in ul_list:
for li in ul.find_all( "li" ):
text = li.get_text( strip = True ).lower()
if re.search( r " ( \d + h | \d + m ) " , text):
hours_match = re.search( r " ( \d + ) h" , text)
minutes_match = re.search( r " ( \d + ) m" , text)
h = int (hours_match.group( 1 )) if hours_match else 0
m = int (minutes_match.group( 1 )) if minutes_match else 0
duration = (h * 60 ) + m
break
if duration:
break
Location: infrastructure/scraper/imdb_scraper.py:100
The scraper extracts the top 3 actors from each movie:
cast_tags = soup.select(config. SELECTORS .get( "actors" , "" ))[: 3 ]
actors = [
Actor( id = None , name = cast.text.strip())
for cast in cast_tags if cast.text.strip()
]
Location: infrastructure/scraper/imdb_scraper.py:115
Error Handling & Retry Logic
Robust Request Handling
All HTTP requests use the make_request utility with exponential backoff:
response = make_request(
url = detail_url,
proxy_provider = self .proxy_provider,
tor_rotator = self .tor_rotator
)
if not response:
logger.warning( f "No se pudo obtener respuesta para la URL: { detail_url } " )
return None
Location: infrastructure/scraper/imdb_scraper.py:71
Retry Configuration
MAX_RETRIES = 3
RETRY_DELAYS = [ 1 , 3 , 5 ] # Exponential backoff in seconds
REQUEST_TIMEOUT = 10
BLOCK_CODES = [ 202 , 403 , 404 , 429 , 500 ]
Location: shared/config/config.py:50
Fallback Strategy
The request utility implements a multi-layer fallback:
Primary: Premium proxy (DataImpulse)
Fallback: TOR network with IP rotation
Final: Direct connection through VPN
Location: infrastructure/scraper/utils.py:34
Data Validation
Before persisting, the scraper validates extracted data:
try :
movie = self ._scrape_movie_detail(indexed_id)
if movie:
self .use_case.execute(movie)
except ValueError as e:
logger.warning( f "Datos inválidos para { imdb_id } : { e } . Saltando guardado." )
except Exception as e:
logger.error( f "Error inesperado al procesar y guardar { imdb_id } : { e } " , exc_info = True )
Location: infrastructure/scraper/imdb_scraper.py:58
Traffic Monitoring
The scraper tracks bandwidth usage:
self .total_bytes_used += len (response.content)
# At completion:
logger.info( f "Tráfico total usado: { self .total_bytes_used / ( 1024 ** 2 ) :.2f} MB" )
Location: infrastructure/scraper/imdb_scraper.py:81
Configuration Options
Key configuration options in shared/config/config.py:
# Scraping parameters
BASE_URL = "https://www.imdb.com"
CHART_TOP_PATH = "/chart/top/"
TITLE_DETAIL_PATH = "/title/ {id} /"
NUM_MOVIES = 250
# Request settings
REQUEST_TIMEOUT = 10
MAX_RETRIES = 3
RETRY_DELAYS = [ 1 , 3 , 5 ]
# Concurrency
MAX_THREADS = 50
# User-Agent rotation
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/91.0.4472.124 Safari/537.36" ,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/537.36" ,
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/88.0.4324.96 Safari/537.36" ,
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5) AppleWebKit/537.36 Chrome/90.0.4430.91 Mobile Safari/537.36"
]
Next Steps
Network Evasion Learn about the multi-layer proxy and TOR setup
Concurrency Explore parallel processing with ThreadPoolExecutor